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Abstract. We present a general setting for structure-sequence comparison in a large class 
of RNA structures that unifies and generalizes a number of recent works on specific families 
on structures. Our approach is based on tree decomposition of structures and gives rises to a 
general parameterized algorithm, where the exponential part of the complexity depends on the 
family of structures. For each of the previously studied families, our algorithm has the same 
complexity as the specific algorithm that had been given before. 

1 Introduction 

The RNA structure-sequence comparison problem arises in two main kinds of applications: searching 
for a given structured RNA in a long sequence or a set of sequences, and three-dimensional modeling 
by homology. In [6] , Jiang and al. addressed the problem of pairwise comparison of RNA structures in 
its full generality. They defined the edit distance problem on RNA structures represented as graphs, 
using a set of atomic edition operations. Notably, they gave a dynamic programming algorithm in 
0{nm 3 ) time complexity for comparing a sequence to a nested structure where n is the length of the 
sequence with known structure and m the length of the sequence of unknown structure. They also 
established that computing the edit distance between a sequence and a structure is a Max-SNP-hard 
problem if the structure contains pseudoknots. 

Meanwhile, consiodering all the known interactions in RNAs including non-canonical ones [5] 
and pseudoknots is crucial for precise structure-sequence alignment. In [6], a polynomial algorithm 
was developed for pseudoknotted structures, but it involves constraints on the costs of the edition 
operations. It has been observed that the so-called H-type and kissing-hairpin pseudoknots represent 
more than 80% of the pseudoknots in known structures [9]. If the structure contains only only H- 
Type pseudoknots, the alignment problem can be solved in 0(nm 3 ) |5lllj . Moreover, in [5] these two 
classes of pseudoknots where embedded into a more general class, the standard pseudoknots. Two 
0(nm k ) and 0(nm k+1 ) algorithms, respectively for a single standard pseudoknot, and for a standard 
pseudoknot which is embedded inside a nested structure were developed, where k is the so-called 
degree of the pseudoknot. Recently, the more general class of simple non-standard pseudoknots was 
defined, and an algorithm was given in 0(nm k+1 ) if alone, or in 0(nm k+2 ) for a simple recursion 

In the present paper, we give a general setting for sequence-structure comparison in a large class 
of RNA structures that unifies and generalizes all the above families of structures. Notably, we handle 
structures where every nucleotide can be paired to any number of other nucleotides, thus considering 
all kinds of non-canonical interactions [8] . Our approach is based on tree decomposition of structures 
and gives rises to a general parameterized algorithm, where the exponential part of the complexity 
depends on the family of structures. For each of the previously studied families, our algorithm has 
the same complexity as the specific algorithm that had been given before. Table Q] give a summary of 
the previous works that are generalized by our approaches, and the time complexity of our algorithm 
for each of the classes. 
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Table 1. Summary of existing algorithms for structure-sequence alignment. Our approach unifies all these 
algorithms and, for each class of structures captured by pre-existing works, specializes into time complexities 
that matches previous efforts: the tree structure represents an inclusion relation. Hence the root class RCS 
includes all other classes. Notation: k is the degree of the pseudoknot/ (simple) standard structure. 



2 Sequence-Structure Alignment 



At first let us state some definitions, starting with the concept of arc-annotated sequence, which will 
be used as an abstract representation for RNA structure. 

Definition 1 (Arc-annotated sequence). An arc- annotated sequence is a pair (S,P), where S 
is a sequence over an alphabet S and P is a set of unordered pairs of positions in S. 

For RNA structures, obviously S is the nucleotide sequence and P the set of the interactions over 
S, as illustrated by Figure [2] (Upper part). Here £ = {A, U, G, C}, and any € P represents an 
interaction between the nucleotides at position i and j. The nucleotides are numbered from 1 to n 
(where n is the sequence length), follow a 5' to 3' order, and S[i) denotes the nucleotide number i. 
One should notice that, unlike most definitions, a position i can be involved in multiple interactions, 
allowing for the representation of tertiary structures. In the following, we sometimes refer to such an 
arc-annotated sequence as an RNA graph interaction structure or, in short, an RNA graph. 

There exists several equivalent ways to define a structure-sequence alignment, that is an alignment 
between an arc-annotated sequence and a (plain) sequence. We choose to abstract an alignment as a 
mapping, i.e. we represent an alignment as a partial mapping between positions in the arc-annotated 
sequence and positions in the plain sequence, as shown in Figure [2j 

Definition 2 (Structure-Sequence Alignment). A structure- sequence alignment between an arc- 
annotated sequence A = (Sa, Pa) of size n and a plain sequence B = (Sb, 0) of size m is a partial 
mapping /i from [l,n] to [1, m + 1] such that: 

— fj, is infective. 

— [i preserves the order: fi(i) < =>■ i < j. 

We write IF {A, B) (or F when there is no ambiguity) for the set of all possible alignments between 
A and B. 

Remark that some positions in the arc-annotated sequence may be without a corresponding 
position in the plain sequence. We note fj,(i) =_L if position i does not have an image by /i, and 
qualify it as unmatched. Consecutive sequences of unmatched positions in the structure (n(i) =-L) 
or in the sequence (/i~ 1 (z) =_L) are usually grouped and scored together. A (composite) gap is 
then the maximum set of consecutive positions (i, . . . , j) that are either unmatched or do no have an 
antecedent by fi. The length \g\ of a gap g is simply the number of positions it contains. By grouping 
unmatched positions within gaps, one can handle affine penalty functions in the cost function. 
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Fig. 1. A representation of a partial mapping between an 
arc-annotated sequence (upper part) and a (plain) sequence 
(lower part). 
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Let us now define the cost of an alignment, which captures the level of similarity between two 
RNAs, and needs account both for structure and sequence elements. 

Definition 3 (Cost of a Sequence-Structure Alignment). The cost of a structure-sequence 
alignment fi between an arc- annotated sequence A = (Sa,Pa) of size n and a plain sequence B = 
(Sb,0) is defined by: 

Cost(/i)= 7(*. **(*))+ x a(\9\)+ J2 -WM)+ J2 (i) 

ie[l,n],n(i)^± gap gCA gap gCB (i.j)ePA 

where 

— "f(i, is the cost of a base substitution between position i in A and position u(i) in B. 

— Xy(x) = ay ■ x + f3y, is the affine cost penalty for a gap of length x in a sequence Y . 

— ip(i, j, u(i), is the cost of an arc removing = =-L), arc altering (fi(i) or fj,(J) =1 
or arc substitution involving paired positions i and j in A. 

Note that unlike the score functions defined in [5111] . our formulation captures arc-alterations 
and arc-breakings as atomic operations on their interactions (as in 0), allowing for general cost 
schemes. 

Definition 4 (Structure-Sequence Alignment Problem). Given an arc-annotated sequences A 
and a plain sequence B = (Sb, 0), the structure-sequence alignment problem is to find an alignment 
between A and B having the minimum cost. 

As stated in |12l2j the problem is already NP-Hard if we consider Crossing interactions (i.e. 
pseudoknots, without multiple pairings). In the following, we will use a parameterized approach to 
handle general structures including unrestricted crossing interactions and multiple interactions per 
position, assuming that the total number of interactions per position is bounded by a constant. 



3 Tree Decomposition and Alignment Algorithm 

As previously sketched, our alignment algorithm relies on tree decomposition of an arc-annotated 
sequence. Tree decompositions are usually defined on graphs rather than on arc-annotated sequences. 
Here we give a straightforward adaptation that preserves all the properties of the standard tree 
decompositions. 



3.1 Definitions 

Definition 5 (Tree Decomposition of an arc-annotated sequence). Given an arc-annotated 
sequence A = (S, P), a tree decomposition of A is a pair (AT, T) where X — {ATi, . . . , Xn} is a family 
of subsets of positions i € [l,n]}, n = length(S), and T is a tree whose nodes are the subsets Xi 
( called bags ), satisfying the following properties: 
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Fig. 2. An arc-annotated sequence and an 
associated tree decomposition. Each bag 
contains three positions, so the tree width of 
this tree decomposition is 2. As there is only 
j one bag which does not respect the smooth 
j property (the one with two children) , the 
tree decomposition is 1-weakly-smooth. 



Each position belongs to a bag: (J ; 

Both ends of an interaction are present in a bag: G P, 31 G [l,JV],{i,j} C Xi. 

Consecutive positions are both present in a bag: Vi G [1, n — 1], 31 £ [1, N], {i, i + 1} C X[. 
For every Xi and X s , I, s € [1,N], Xi n X s C X r for all X r on the path between Xi and X s 

Figure [2] illustrates the tree decomposition of an arc-annotated sequence. 



Definition 6 (Treewidth). The width of a tree decomposition (X,T) is the size of its largest set 
Xi minus one. The treewidth t w (A) of an arc-annotated sequence A is the minimum width among 
all possible tree decompositions of A. 

In general, tree decomposition are not rooted. Nevertheless, for the sake of clarity, we will arbi- 
trarily choose a root Xq in our decompositions. Additionally, we use the following notation: for any 
two bags Xi and X r in X, we write Xi r as shorthand for Xi n X r . 

The process of assigning consecutive positions can be simplified by the affine gap penalties, using 
the general idea of Gotoh's algorithm 0]. Let us then define the notion of smooth bag, that will help 
us take advantage of this optimization. 

Definition 7 (Smooth Bag of a Tree Decomposition). Let Xi £ X be a bag in a tree decom- 
position (X,T) for an arc-annotated sequence A = (S,P). If Xi ^= Xq, then let X r be its father. Xi 
is then smooth iff there exist two consecutive positions i and j such that i 6 X\ — X r , j € X^ r , j 
is not in one of the child of Xi, and there is no i' € Xi — X r such that (i ,j) G P or i' (except i) 
consecutive to j. The root Xq is smooth iff (1) there exist two consecutive positions i,j G Xq such 
that j is not in any child of Xq and P, or (2) iff the size of the root is strictly smaller than 

the size of one of its children. 

Definition 8 ((Weakly-) Smooth Tree Decomposition). A tree decomposition (X,T) for an 

arc-annotated sequence A — (S,P) is smooth iff every bag Xi G X are smooth. 

A tree decomposition (X,T) is k- Weakly -Smooth iff at most k of its bags are not smooth. 

As stated in [3], a tree decomposition can always be converted into a binary tree in linear time. 
Moreover, this transformation can be done without breaking the smoothness of the tree decomposi- 
tion. Therefore, we will limit, without loss of generality, the scope of our algorithm and analysis to 
binary tree decompositions. At last, if the tree decomposition is smooth and root Xq is smooth by 
condition (1), then the tree can made smooth by condition (2) by creating a new root composed of 
the positions of the old one except the position i (with i as in the definition above) and add it as 
the father of Xq. In the following we will only consider case (2) for the alignment algorithms. 



3.2 An Alignment Algorithm based on Tree Decomposition 

Let us describe an algorithm that computes the minimum cost alignment of an arc-annotated se- 
quence A and a (flat) sequence B (Pg = 0). This algorithm implements a dynamic programming 
strategy, based on a k- Weakly-Smooth tree decomposition of A. 

An alternative internal representation for alignments. The recursive step in our scheme 
consists in extending a partial alignment, assigning positions that are proper to the bag (i.e. not 
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Fig. 3. Internal representation of alignments as a pair / = (ft,S), 
used within our dynamic programming algorithm. A function /j de- 
fines a complete mapping, while 8 discriminates matched positions 
(solid stroke) from unmatched ones (dashed stroke). Additionally, 
unmatched positions are forced to aggregate to their nearest matched 
neighbor to their right, and we create an additional virtual position 
(16 here) to provide an image to trailing unmatched positions in the 
structure. 



present in the bag's father). One of the main challenge is to preserve the sequential order. Indeed, 
only consecutive positions 1) are guaranteed to be simultaneously present in a bag, allowing 

for a direct control over the sequential ordering (fJ>(i) < fi(i + 1)) at the time of an assignment. 
Intuitively, this property may extend transitively over /i, since fi(i) < + and fi(i + l) < /i(i + 2) 
implies /i(i) < fi(i + 2). However, this property no longer holds when a position is unmatched, as 
fi(i + 1) is then undefined and cannot serve as a reference point for the relative positioning of (i(i) 
and + 2). 

To work around this issue, we identify, in the description of our algorithm, an alignment with a 
pair / = (/i, 5), where fi : [1, n] — > [1, m + 1] is a full ordered mapping (/x(-) ^_L) and 5 : [1, n] — > [0, 1] 
distinguishes between matched (5(i) — 1) and unmatched positions (5(i) — 0). As illustrated by 
Figure [3j this new representation aggregates consecutive unmatched positions in A to their nearest 
rightward position that is matched in the alignment. A virtual position m + 1 is then added to B to 
serve as an image for the unmatched positions appearing at the end of A. 

Dynamic programming recursion. Our dynamic programming algorithm assigns positions in B 
to the elements of a bag, proceeding to a recursive call that assigns the positions found further down 
in the tree decomposition. This requires a few additional notions and definitions. 

Let (S, P) be an arc-annotated sequence, and S' be a subset of S. The arc-annotated sub- 
sequence induced by S' is the pair (S',P') such that P' is the subset of arcs in P whose both 
extremities are in 5". Let Xi be a bag in the tree-decomposition of A. The descending subsequence 
of Xi is the subset of positions appearing in the descendants of Xi in the tree. The descending 
arc-annotated subsequence of Xi is the arc-annotated subsequence induced by its descending 
subsequence. The notion of alignment between A and B naturally extends to alignments involving a 
sub-arc-annotated structure of A, and we denote by J^ls' the set of all possible alignments between 
the arc-annotated subsequence induced by S', and B. 

Let Xi be a bag having father X r (X r := X[ when I = 0), let / S F\xi r be an alignment for the 
common positions of Xi and X r to B. Let us denote by Cj the cost of the best alignment between 
the descending arc-annotated subsequence of Xi and B, which matches / on X^ r . It can be shown 
that C l f obeys 



C/= , mm {lCo5t(X l ,f')+ V Cj,, }. 2 
f'(i)=f(i)yiex, r I sesonsil) 



Moreover, the local contribution LCost(Xi, f) of a bag Xi to an alignment / = (/i,<5) is 



LCost(A;, /) = 7(i,/i(i))+ ip(i, j, f(i), f{j)) // Bases and interactions 

ieXi-X r i,j&Xi s.t. 

8(i) = l iorj<EXi-X r 
and (i,j)ePA 

+ o>b ■ (p(i + 1) — /•*(*)) + &Bi II Gaps in sequence B 

i,i+l£Xi s.t. 
i or i+leXi-X r 
and /^(z+l) >/i(i) 

+ Pa + Ck^4, // Gaps in sequence A 

i.i+lGX, s.t. ieXi-X r 
i or i+leXi-X r s.t. S(i)=0 

6(i) = l and <5(i+l)=0 

assuming a gentle abuse of notation, in which f(i) = if S(i) = 1, or _L otherwise}. 

A dynamic programming algorithm follows from this general recurrence equation, in a standard 
way. The cost of the best alignment is given by min{C'j \ f € -^Ixol- A simple backtrack procedure 
gives the best alignment between A and B. The following theorem gives the worst-case complexity 
of the algorithm. 

Theorem 1. Let A and B be two arc- annotated sequences (Pb = <ZS), and let (X,T) be a tree 
decomposition of A. The structure-sequence alignment of A and B can be computed in 0(N ■ m t+1 ) 
time and 0(N ■ to*) in space, where N = \X\, t is the tree-width of {X,T), and m = \B\. 

This complexity can be further improved when the tree decomposition is smooth (or even only k- 
weakly smooth), by taking advantage of the affine nature of gap penalty functions. To that purpose, 
one uses the general principle underlying Gotoh's algorithm [4], and introduce a secondary matrix 
to distinguish gap openings from gap-extensions. We obtain the dynamic programming equation 
summarized in Figure |4] 

Theorem 2. // the tree decomposition of A is k-weakly smooth, then the sequence- structure align- 
ment of A and B , can be computed in 0{k ■ m t+1 + (N — k) ■ to*) time. 

Corollary 1. Let A and B be as before. If the tree decomposition (X,T) of A is smooth and has 
width t, then time complexity of the structure- sequence alignment algorithm is in 0(N ■ m'). 

4 Tree Decomposition and Sequence Structure Alignment of RNA 
Structures 

In its full generality, the problem of computing a tree decomposition of minimum width for an 
arc-annotated sequence is NP-Hard pQ. However, by restricting the problem to some specific RNA 
structure families, one can obtain a tree decomposition with a small width in reasonable time. The 
key idea relies on a total ordering of the positions in the arc-annotated sequence, as shown in the 
following. For the sake of clarity, we suppose at first that there is no unpaired position in the structure. 

Definition 9. A wave embedding W of an arc-annotated sequence A — (S, P) is defined by an 
increasing sequence of pivot positions y = {yi}i =0 , such that yo := 1 and y^ '.= n (Figure^). The 
degree of a Wave Embedding is its number of pivots minus one. 

Now, a total ordering on the interactions can be inferred from a wave embedding, in which case the 
wave embedding is said to be ordering. Let us now give a sufficient condition for a given embedding 
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Fig. 4. Dynamic programming equation for aligning a smooth bag Xi , whose father X r has previously been 
assigned. Case 1. and 2. above apply respectively to smooth bags such that i — 1 g X r or i + 1 G X r (note 
that these two cases are mutually exclusive from the definition of smoothness). 



to be ordering. To that purpose, we introduce the upward graph, and show that its acyclicity is a 
sufficient condition for the embedding to be ordering. 

Given a wave embedding W of an arc-annotated sequence A — (S, P) , we call intervals of W 
the half open intervals: I t = [yt,yt+i[ for t £ [1, k — 2] and the interval Ik = fyk—uyk]- Now, let us 
start by defining a partial order • -< ■ on the positions of a single interval, as follows: for any i, j G It, 
one has i -< j if either i < j and t is odd, or i > j and t is even. This relation can be used to define 
the position directly below i, i.e. the closest position i~ to i such that i~ -< j. In other words, one 
has i~~ = i — 1 if i, i — 1 belong to an odd interval, and i — 1 G It or i~ = i + 1 if i, i + 1 belongs to an 
even interval. In the absence of such a position, we set i~ = 0. The highest position of an interval J* 
is the position i G It such that there is no position j in I t with i -< j. Now we can define the upward 
graph: 

Definition 10. Given a wave embedding W of an arc-annotated sequence A = (S 1 , P), the upward 

graph of A associated to W is the directed graph G = (Vg,Aq) such that Vq = P, and Aq is the 
set of arcs ((«, j) i— > such that i or j is directly below i' or j' in W . 

A wave embedding of an arc-annotated sequence A = (S, P) is ordering if its upward graph is 
acyclic. 

Algorithm [T] takes as input an upward graph, supposed to be acyclic, and assigns a level for each 
vertex, thus a level for each interaction of the associated arc-annotated sequence . This algorithm is 
a straightforward modification of Kahn's topological ordering algorithm 7 , illustrated in Figure [6l 

Now we define the level of any position as the minimum level of all interactions in which the 
position is implicated, as illustrated in Figure [6] 




Fig. 5. A: An arc-annotated sequence with its pivots. B: An embedding wave representation of the same 
arc-annotated sequence. C: The associated (acyclic) upward graph (interactions are rank by their level). 



Algorithm 1: Level Algorithm 

Input : a directed acyclic graph G = (Vg, Ag) 
Output: Assign a level for each vertex 

- L = {veVa, d~(v) = 0} 

- For each v € L, leveliv) = 1 

- While L 0: 

• pop front v from L. 

• For each v' such that (v,v') £ Ag'- 

* remove {v,v') from Ag 

* If tT(u') = 0: 

■ push back v' into L 

■ level(v') = level(v) + 1 



Definition 11 (Level of a position). Given an ordering wave embedding W of an arc-annotated 

sequence A — (S, P), the level of a position i defined by: levelOi) — min (level((i,j))). We define 

(iJ)eP 

a total order A on S through: i A. j iff either level(i) < level(j), or level(i) = level(j) and i < j. 

Then, we introduce Algorithm [5] which, starting from an ordering wave embedding, decomposes 
any arc-annotated sequence. The key idea is to create a root which contains the highest position in 
each interval. The successor of a bag is then obtained by changing the highest level position into the 
position directly below it (Figure [5]) . 

Theorem 3. Given an ordering wave embedding of degree k for an arc-annotated sequence A, then 
a tree decomposition of A having width k can always be computed in time 0(k ■ n). 

Corollary 2. Let A and B be two arc-annotated sequences with Pb = 0- Given an ordering wave 
embedding of degree k of A, the structure- sequence alignment of A and B can be computed in 0(n-m k ). 

5 Application to three general classes of structures 

In this section, we define three new structure classes, which respectively generalize the standard 
pseudoknots [5], the simple non-standard pseudoknots [11] and the standard triple helices [10]. For 
each of them, a natural ordering wave embedding can be found, such that our general alignment 
algorithm has the same complexity as its, previously introduced, ad hoc alternatives. 



Algorithm 2: Chaining Algorithm 



Input : an arc-annotated sequence A and an ordering wave embedding of degree k 
Output: A tree decomposition of A 

— Assign a level for each interaction using Algorithm [TJ and map level to each position 

— X = and T is an empty tree 

— Create a node Xq composed of the highest position of each interval and set Xo as the root of T 
-1 = 

— While there is a position i £ Xi such that i~ 7^ 

• Search the position p G Xi with the highest level and such that p~ 7^ 

• Add p~ to Xi 

• Add Xi to X 

• If I > 0, set Xi as the son of 

• Set I = I + 1 and X; = AVi - M 

— Return (X, T) 




Fig. 6. A: Arc-annotated sequence and its pivots. B: Ranking of the positions by decreasing for the level. C: 
Tree decomposition obtained with Algorithm [2] The highlighted position in each bag is the position denoted 
as position p within Algorithm [2] The last position in each bag is the position p~ . 



5.1 Standard Structures 

Here we define and describe the alignment of standard structures, a natural generalization of the 
standard pseudoknots defined by Han et al [5]. The main specificity of this class is that bases can 
interact with several other bases. This allows the consideration oft multiple non-canonical interactions 
(e.g. base triples) in RNA structures (see Figure [TJA). 

Definition 12 (Standard Structure). An arc-annotated sequence A — (S,P) is a standard 
structure if there exists an ordering wave embedding, based on a pivot list y = {y.i}i =0 , k > 1, such 
that the extremities of any interaction (i,j) G P are separated by exactly one pivot. 

The ordering wave embedding can then be used by Algorithm [2] to yield a smooth tree decom- 
position of width k, therefore the complexity of the structure-sequence alignment is 0(n ■ m k ). 



5.2 Simple Non-Standard Structures 

In [TT], the algorithm of [5] is extended to capture the so-called simple non-standard pseudoknots. 
Briefly, a simple non-standard pseudoknot contains a standard pseudoknot, and defines a special re- 
gion from which interactions may initiate, possibly crossing interactions in the standard pseudoknot. 
We extend this class in order to capture multiple interactions, as illustrated by Figure [7lB. 
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representation. B: A sim- 
ple Non-Standard struc- 
ture and one of its ordering 
wave embedding represen- 
tation (k' = 2). C: An ex- 
tended triple helix and one 
of its ordering wave embed- 
ding representation. 



Definition 13 (Simple Non-Standard Structure). An arc- annotated sequence A — (S,P) is a 
simple non-standard structure (Type I) if there exist an ordering wave embedding, based on a 
pivot list y = {yi}i—Q, k > 1 and r 6 [l,fc — k' — 1], k' 6 {1,2}, such that the extremities of any 
interaction (i,j) G P with j < yu—k' ar £ separated by exactly one pivot and the others interactions 
€ P with yk-k' < j are such that y T -i ^ i < y T - 

As in [llj . Type II simple non-standard structures are symmetric to Type I: the special region lies on 
the beginning of the sequence. To be coherent with the definition of simple non-standard pseudoknots 
given in [llj . we define the degree of a standard structure as it number of pivots in its ordering 
wave embedding. Therefore, the treewidth of a simple non-standard structure of degree k + 1 is at 
most k + 1 and, given its pivots sequence, we can build a smooth tree decomposition of width k + 1. 
Hence the complexity of the structure-sequence alignment is 0(n ■ m k+1 ). 

5.3 Extended Standard Triple Helices 

To our knowledge, the standard triple helices [10] constitute the first attempt to handle base triples 
in sequence/structure alignments. A standard triple helix is a kind of standard pseudoknot of degree 
3 where some positions are allowed to be involved in multiple base pairs. 

We define the extended standard triple helix as the structures admitting an ordering wave 
embedding of degree 3 (Figure 00). This new class strictly includes standard triple helices. Fur- 
thermore, each such structure can be represented by a tree-decomposition which is smooth and has 
width at most 3. This gives an algorithm in 0(n ■ m 3 ) for the structure-sequence alignment. 

6 Recursive Structures 

Now we consider much more general RNA structures, where different kinds of pseudoknots can occur 
anywhere. As will be seen below, such structures can be decomposed into primitives, and from the 
tree-decomposition of each primitive a global tree-decomposition of the structure can be built. The 
set of primitive sub-arc-annotated subsequences (primitives for short) of an arc-annotated 
sequence is the set of all sub-arc-annotated sequences induced by the connected components of its 
conflict graph, which is defined as follows. The conflict graph G = (V, E) of an arc-annotated 
sequence A = (S, P) is the graph such that: 

— V = P (the nodes of G are the interactions of A). 

— (i>i,V2) € E with Vi = (ii,ii) and V2 — («2, J2) {h < «a) iff h < h < ji < fa (interactions cross). 



Algorithm 3: Recursive Algorithm 

— Compute a tree decomposition for all primitive extensions using Algorithm [2] 

— For each primitive Ao of depth 0: 

• Create a bag x containing the right boundary of Ao and the left boundary of its next primitive A' . 

• Let (X°, T°) be the tree decomposition of the extension of Ao and (X'°,T'°) the one of the 
extension of A' . 

• Add to x the position i of the root of (X'° ,T'°) such that i — 1 or i + 1 belongs to the root too. 

• Connect the leaf of (X°, T°) to x and connect x to the root of (X'°, T'°). 

— For each possible depth i > in increasing order: 

• For each primitive A of depth i: 

* Let (X, T) be the tree decomposition of the extension of A and (X' ,T') the one of the 
extension of the arc-annotated sequence A' in which A is encapsulated. 

* Find a bag x m (X' ,T') such that x contains the boundaries of A. 

* Connect (X, T) to (X',T r ) by connecting the root of (X, T) to x- 

* Add the right-most boundary of A in (X, T) to all bags from the root to the first bag 
containing it. 



The boundaries of a primitive are its left-most and right-most positions. 

Let A and A' be two primitives of an arc-annotated sequence, and let i' and j' be the boundaries 
of A' (i' < j'). We say that A is encapsulated in A' iff for any position i 6 A, one has i' < i < j' , 
and there exists at least one position j € A such that i' < j < j' . The depth of a primitive of an 
arc-annotated sequence is the number of primitives which encapsulate it. We say that A is directly 
encapsulated in A' if A is encapsulated in A' and depth(A) — depth(A') + 1. 

The extension of a primitive A of depth i is the arc-annotated subsequence consisting of: the 
primitive A] the boundaries of any primitive that is directly encapsulated in A; and the unpaired 
positions that are directly encapsulated in A. 

Given a primitive Aq of depth of an arc-annotated sequence A, the next primitive of Ao is 
the following primitive of level in the sequential order (note that they can share a boundary). 

Theorem 4. Let A be an arc-annotated sequence of size n. If there exist an ordering wave embed- 
dings of degree at most k for each extension of its primitives, then the treewidth of A is at most k + 1, 
and a n-weakly smooth tree decomposition of A can be built in 0(k ■ n) time, where k is the number 
of primitives of odd degree, whose level is greater or equal to 1 . 

Corollary 3. Let A and B be two arc- annotated sequences with Pb = 0- Given an ordering wave 
embedding of degree k (at most) for each extension of the primitives of A, the structure- sequence 
alignment of A and B can be computed in 0{k ■ m k+1 + n ■ m k ) (with n defined in Theorem^. 

7 Conclusion 

We have given a general parameterized dynamic programming scheme for sequence-structure com- 
parison in a large class of RNA structures, which unifies and generalizes several families of structures 
that have been independently considered by previous works. Notably, we can handle structures 
where each nucleotide can be paired to any number of other nucleotides, thus cqpturing any type of 
non-canonical interactions. Our approach relies on a tree decomposition approach of arc-annotated 
sequences represented as wave embeddings, and the treewidth of the decomposition is then equal to 
the degree of the wave embedding. Computing a wave embedding of small degree is easy for all classes 
of pseudoknotted structures considered in this paper. However, the problem of finding a minimum 
degree wave embedding for any kind of pseudoknotted structure remains open. 
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Fig. 8. (A) An arc-annotated sequence and (B) a representation of the tree decomposition given by Algorithm 
[3] Each box correspond to an extension of a primitive and its associated tree decomposition. Dashed links 
correspond to the connections made by Algorithm [3] The Dashed bag (top of B) correspond to the bag \ 
added to connect two consecutive primitive of level 0. The position 8 in a dashed box illustrates the case 
where the right boundary of a primitive need to be added in its tree decomposition. 
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